A deep learning model predicts the presence of diverse cancer types using circulating tumor cells

Circulating tumor cells (CTCs) are cancer cells that detach from the primary tumor and intravasate into the bloodstream. Thus, non-invasive liquid biopsies are being used to analyze CTC-expressed genes to identify potential cancer biomarkers. In this regard, several studies have used gene expression changes in blood to predict the presence of CTC and, consequently, cancer. However, the CTC mRNA data has not been used to develop a generic approach that indicates the presence of multiple cancer types. In this study, we developed such a generic approach. Briefly, we designed two computational workflows, one using the raw mRNA data and deep learning (DL) and the other exploiting five hub gene ranking algorithms (Degree, Maximum Neighborhood Component, Betweenness Centrality, Closeness Centrality, and Stress Centrality) with machine learning (ML). Both workflows aim to determine the top genes that best distinguish cancer types based on the CTC mRNA data. We demonstrate that our automated, robust DL framework (DNNraw) more accurately indicates the presence of multiple cancer types using the CTC gene expression data than multiple ML approaches. The DL approach achieved average precision of 0.9652, recall of 0.9640, f1-score of 0.9638 and overall accuracy of 0.9640. Furthermore, since we designed multiple approaches, we also provide a bioinformatics analysis of the gene commonly identified as top-ranked by the different methods. To our knowledge, this is the first study wherein a generic approach has been developed to predict the presence of multiple cancer types using raw CTC mRNA data, as opposed to other models that require a feature selection step.


Gene expression data
We downloaded CTC samples housed in ctcRbase 11 .The CTCsamples were from six cancer types, including breast cancer (BRCA), colorectal cancer (COAD), prostate cancer (PRAD), non-small cell lung cancer (LUSC), pancreatic cancer (PAAD), melanoma (SKCM) and liver cancer (LIHC) (see Table 1).We performed three preprocessing steps as described by Albaradei et al. 12 .Note that the preprocessing step involves a quality control assessment in tandem with the utilization of normalization techniques to accomplish data standardization and address batch effects.Additionally, since the number of samples is imbalanced, we used the synthetic minority oversampling technique (SMOTE) to oversample the minority class using the imbalanced-learn python library 13 .Then, the data were split five times into 70% for training and 30% for testing.We also tested our models on three external datasets from Gene Expression Omnibus (GEO), i.e., GSE153514, which includes CTC samples from castration-resistant prostate cancer patients, GSE82198, which include CTC samples from colon cancer patients and GSE144561, which include CTC samples in from pancreatic cancer patient.

Features used by the ML and DL models
The ML prediction workflows generally include feature selection steps to avoid dealing with high dimensional data 12 .However, for DL prediction models, there is no need for explicit feature selection in their workflow.The neural network architecture learns features from the data and captures all non-linear relationships 14  Using the PPI network to identify hub genes/features for the ML models First, we used the GeneMANIA (Gene Ontology molecular function-based weighting) Cytoscape 3.6.0plugin 15 to generate a physical protein-protein interaction (PPI) network.Then, we used the Cytoscape CytoHubba plugin to identify hub genes in the constructed PPI network using different local and global scoring techniques.The global technique considers the connection between the node and the entire network, while the local rank method evaluates the relationship between the node and its immediate neighbors.We used five ranking algorithms to determine the hub genes, including two local ranking algorithms, Degree, which calculates the number of adjacent nodes, and Maximum Neighborhood Component (MNC), which calculates the size of the maximum connected component.In addition, we used three global ranking algorithms Betweenness Centrality (BC), which estimates the number of the shortest paths passing through a node; Closeness Centrality, which calculates how short the shortest paths are from a node to all nodes; and Stress Centrality which calculates the absolute number of the shortest path.Genes were ranked based on these five scoring algorithms, and the top 100 hub genes from each ranking method were shortlisted and subsequently used to develop ML models.

Using DeepLIFT to identify genes/features for the DL model
We used the Deep Learning Important FeaTures (DeepLIFT) 16 , which is a feature scoring algorithm to calculate the contribution scores of each neuron (genes) in the input layer of the DL model.DeepLIFT calculates a contribution score for every gene of each input sample.The obtained contribution scores express the importance of the corresponding genes for the output (prediction) layer.Then, we ranked the genes based on their importance scores and selected the top 100 ranked genes for further analyses.

Developing ML and DL models
We created a parameter search space to evaluate different configurations for the Support Vector Machines (SVM), Random Forest (RF), k-nearest neighbor (KNN) and Deep Neural Network (DNN) models (see Table 2).We implemented the ML models, SVM, RF, and KNN, from the Scikit-learn Python library 17 .For the SVM SVC class, we employed the standard parameters, radial basis function kernel with degree = 3 and gamma = auto.We also implemented an RF model with 100 trees in the forest and a max depth of 32.We implemented the KNN model with the KNeighborsClassifier function and the number of nearest neighbors equals 5.
For the DL model, we implemented a DNN that has three hidden layers with 7000, 3000, and 500 nodes using the Python Keras library (https:// github.com/ fchol let/ keras).We employed the SGD algorithm with the default parameters as the optimizer and used cross-entropy to compute the loss between actual and predicted labels.We set the number of epochs to 100 and the batch size to 8. We used the early stopping and dropout (with a drop rate of 0.3) techniques to avoid overfitting.
The analysis compared two gene lists.The first list comprises 66 genes from the union gene list generated from the five topological ranking algorithms.The second list includes the 25 genes commonly identified by the five topological ranking algorithms and DL methods.The statistically significant enriched terms were considered for the adjusted P-value < 0.01.
We also used miRNet 19 (can be accessed from the link: https:// www.mirnet.ca/ miRNet/ home.xhtml) to determine the critical set of microRNA associated with the 66 genes commonly identified as top-ranked by the multiple ranking algorithms used in this study.Note, we did not repeat this process for the 25 commonly identified genes, as the 25 genes are a subset of the 66 commonly identified genes.

Results and discussion
The study design The workflow of our study incorporates six main steps, as depicted in Fig. 1.First, we collated 58,347 genes from 481 CTC samples retrieved from the ctcRbase 11 database accessed in December 2022 (Table 1 provides the statistics of these datasets), which we preprocessed and applied SMOTE on to create an integrated dataset that we split into training and testing sets.Second, we used the integrated data for two objectives, (1) to identify the top 100 hub genes/features to be fed to the ML models using five graph ranking algorithms, and (2) as features (i.e., the entire gene set) to train the DL model.Third, we built and evaluated the ML/DL models using the features described in the previous step.Fourth, we tested our best models using independent datasets.Fifth, we mined the essential genes by determining the commonly identified features/gene set, then utilized ML to evaluate the impact of these genes in the sample classification process, and we performed bioinformatics analyses on the gene set.

Evaluating the prediction performances of the ML and DL models
We evaluated the changes in the prediction performances of the ML models (SVM, RF and KNN) when fed the top 100 features (hub genes) determined by the five ranking algorithms and the DL (DNN) model when we fed the raw mRNA data directly.Briefly, we used the 18,790 genes to construct a PPI network using GeneMANIA.As a preprocessing step, we removed all nodes (genes) with no connected edges, which resulted in a network consisting of 15,660 nodes (i.e., genes) and 159,560 edges (i.e., direct physical PPI).We fed this network into Cytoscape software to visualize and determine the hub genes using the cytoHubba plugin.Then, we obtained the 100 top-ranked hub genes for five topological ranking algorithms, including Degree, Betweenness Centrality (BC), Maximum Neighborhood Component (MNC), Closeness Centrality, and Stress Centrality (see Supplementary Table ).

Prediction performances of models when fed hub genes determined by ranking algorithms
We developed ML models (SVM, RF, and KNN) using features (hub genes) determined by five ranking algorithms separately (see Table 3).Briefly, we first used one of the ranking algorithms to determine the top 100 ranked hub genes.Then, we trained and tested the SVM, RF, and KNN classifiers by feeding them the top 100 ranked genes.We repeated the training and testing five times using different training and testing splits and calculated various metric scores on each test set.Eventually, we aggregated the results by averaging the metric scores on the test data.We performed the same procedure for all ranking algorithms.
Table 3 provides the prediction performances of ML models fed the hub genes as features.The results show that the RF classifier achieved the best result consistently, followed by the SVM classifier, for all five sets of features Figure 1.The study workflow, which consists of six main steps.Firstly, data collection.Then, the data is used to identify the top 100 hub genes/features through graph ranking algorithms for ML models, as well as for training a DL model.Next, building and evaluation of ML/DL models and test them with independent datasets.We then mine essential genes by analyzing commonly identified features/gene sets and assessing their impact using ML.Finally, we perform bioinformatics analyses on the gene set.www.nature.com/scientificreports/determined by the ranking algorithms.The RF classifier achieved the best and second-best prediction performances with an F1-score (a combination of precision and recall metrics) of 0.9424 and 0.9349 using the MNC and BC top 100 ranked hub genes, respectively.Similarly, the SVM classifiers' best and second-best prediction performances were also achieved with the BC (F1-score of 0.9086) and MNC (F1-score of 0.8978) top-ranked hub genes, as well as the worst-performing classifier, KNN.Thus, BC (global ranking algorithm) and MNC (local ranking algorithm) appear to be the better ranking algorithms, followed closely by Degree, while Stress and Closeness Centrality generally produced the worst performances for all the models.

Prediction performance of the DL model when fed the raw mRNA data directly
When using the DL model, DNN, we achieved average precision of 0.9652, recall of 0.9640, f1-score of 0.9638 and overall accuracy of 0.9640 (see Fig. 2).DNN performs better (around 2% higher) than the best ML model performance (RF).The result suggests that the DNN models' way of learning allowed it to better zoom in on the mRNA features that provide the added benefit of the model displaying improved generic capabilities, i.e., to predict the origin of the tumor cell among different primary sites.Thus, we also applied DeepLIFT to calculate importance scores for each gene, which we ranked to select the top 100 ranked genes.The DNN model's prediction performance with these top 100 ranked genes was only around 7% lower than the prediction performance using the entire raw mRNA data set, suggesting that these genes are the key contributors to the DNN model's performance.Moreover, even though we observe a slight drop in the DNN model's performance using the top 100 ranked genes, this result is still on par with the ML models' performances.

Evaluating the prediction performances of the ML and DL models using independent test data
To further assess the robustness of our best-constructed models, RF and DNN.We tested these models on three independent datasets (GSE153514, GSE82198, and GSE144561, see Table 1).The RF models assessed include those built with the top 100 ranked hub genes determined by the best local ranking algorithm MNC, and the best global ranking algorithm BC.The RF/MNC model performed better than the RF/BC model (see Fig. 3).
The RF/MNC model achieved F1-scores of 0.6667 (GSE153514, 6 out of 9 samples were classified correctly as prostate cancer and 3 were misclassified as colorectal cancer), 0.6667 (GSE82198, 2 out of 3 samples were classified correctly and 1 misclassified as breast cancer) and 0.7647 (GSE144561, 13 out of 17 samples were classified correctly as pancreatic cancer and 2 misclassified as colorectal and 2 as breast cancer) for the independent testing datasets.The RF/BC model achieved similar F1-scores of 0.6667 (GSE153514, 6 out of 9 samples were classified correctly as prostate cancer and 3 were misclassified as colorectal cancer), 0.6667 (GSE82198, 2 out of 3 samples were classified correctly and 1 misclassified as breast cancer) and 0.7059 (GSE144561, 11 out of 17 samples were classified correctly as pancreatic cancer and 3 misclassified as breast and 2 as melanoma cancer and 1 as NSCLC) but the misclassifications were different.We also assessed the DNN model built with the entire raw mRNA data set (DNNraw) and the DNN model built with the top 100 ranked genes determined by DeepLIFT (DNNdeeplift).DNNraw achieved slightly better performances than DNNdeeplift and both RF models, with F1-scores of 0.7776 (GSE153514, 7 out of 9 samples were classified correctly as prostate cancer and 2 were misclassified as colorectal  It is evident from the image that the DL model's prediction performance using the top 100 ranked genes is only approximately 7% lower than the performance achieved using the entire raw mRNA dataset.This striking similarity suggests that these selected genes play a crucial role in contributing to the overall performance of the DL model.1).It is evident from the chart that the RF models, constructed using the top 100 ranked hub genes determined by the MNC local ranking algorithm, outperformed the RF models built with the BC global ranking algorithm.Furthermore, DNNraw achieved slightly better performances than both DNNdeeplift and the RF models across the datasets.3.4Mining influential genes.
model, the 100 top-ranked genes represented by DNNdeeplift do not achieve better prediction performance than the RF/MNC and RF/BC models.

Identifying the influential genes using data mining techniques
The prediction performances for the RF/MNC and RF/BC models show that the best local ranking algorithm MNC, and the best global ranking algorithm BC are not zooming on the most influential genes very effectively.Thus, we here further consider if the genes commonly identified as top-ranked by all the ranking algorithms, increases the likelihood that the gene would be an influential gene.
Determining the influential genes based on their contribution to the prediction performances Here, we identified the set of genes commonly identified as top-ranked hub genes by all five ranking algorithms (Degree, BC, MNC, Closeness Centrality, and Stress Centrality.Approximately two-thirds of the genes (66 genes) were commonly identified by all five ranking algorithms.Furthermore, since we also used DeepLIFT to calculate the importance scores of each gene used in the DNNraw model to identify the 100 top-ranked genes, we also determined the set of genes commonly identified by the five ranking algorithms and DeepLIFT.We found that approximately one-quarter of the genes (25 genes) were commonly identified by all five ranking algorithms and DeepLIFT.
To assess if these are the influential genes, we further compare the prediction performance of the best performing DNN, SVM, RF, and KNN, with DNN, SVM, RF and KNN models built using the 66 commonly identified top-ranked genes, and the models built using the 25 commonly identified top-ranked genes (see Fig. 4).Here, for the models built using the 66 commonly identified top-ranked genes, the RF model continues to outperform the SVM and KNN models.Moreover, the RF model built using the 66 genes achieved an F1-score of 0.9404, almost identical to the RF/MNC model's performance (F1-score of 0.9424).The DL model built with the 66 genes also slightly outperforms the DNNdeeplift model with F1-scores of 0.9167 and 0.8945, respectively.These results show that the 66 commonly identified top-ranked genes produce prediction performances identical to the performances when using the 100 top-ranked genes, which suggests the 66 genes are the influential genes.Moreover, this finding is further substantiated by the loss in performance observed for the models built using the 25 commonly identified top-ranked genes.Nonetheless, since the loss in performance of the models constructed using the 25 genes only ranges between 0.0144 and 0.0987, this, too, shows the substantial impact of the 25 genes.

Bioinformatics analyses of the commonly identified top-ranked genes
We further conducted an enrichment study focused on the commonly identified top-ranked genes.Table 4 lists the top 10 GO phrases associated with the 66 hub genes commonly identified by the five ranking algorithms as top-ranked.The GO terms were related to body size, embryonic lethality, abnormal cell cycle, decreased fibroblast proliferation, and decreased immature B cell number for the MGI Mammalian phenotype database; regulation of the apoptotic process, DNA damage response, and protein modification for GO biological process database; and cancer pathway, thyroid hormone signaling pathway, PI3K-Akt signaling pathway, and Estrogen signaling Figure 4.The column chart compares prediction performances among the best-performing ML and DL methods built with the 66 and 25 commonly identified top-ranked genes separately.For the models built using the 66 commonly identified genes, the RF model consistently outperforms the SVM and KNN models.Additionally, the DL model constructed with the 66 genes slightly outperforms the DNNdeeplift model.Also, Despite a decrease in performance when using the 25 top-ranked genes, the loss in performance ranges from only 0.0144 to 0.0987.This highlights the substantial impact of these 25 genes as well.
pathway for the KEGG database.No significant GO terms were detected using the GWAS catalog database.Of the 66 hub genes, 24 genes function in the 'regulation of apoptotic process'; 22 genes in 'decreased body size' and 'negative regulation of the apoptotic process'; and 20 genes in 'pathways in cancer' .The top significant terms across the four databases used in this analysis relate to ' embryonic lethality (MP:0011096)' with an adjusted P-value of 1.7e−16.Considering the KEGG databases, the top significant GO term is 'cancer pathways' with an adjusted P-value of 4e−17.Additionally, 5 of the top 10 significant terms for the KEGG databases are cancer pathway related, including 'endometrial cancer' , 'breast cancer' , 'prostate cancer' , 'proteoglycans in cancer' , and 'pathways in cancer' .Table 5 provides the top 20 genes involved in cancer pathways based on enrichment analysis using the KEGG database.In the Supplementary Material, we provide complete information on the enrichment analysis results, including the bar plots for enrichment analysis and the top 20 significant GO terms detected from each database.
We also conducted GO enrichment for the 25 genes commonly identified by the five ranking algorithms and DeepLIFT.Table 6 lists the top 10 GO phrases associated with the 25 genes.The enriched GO phrases include GO phrases related to cancer and pathways such as 'Bladder cancer' , 'Breast cancer' , 'Transcriptional misregulation in cancer' , 'MicroRNAs in cancer' , and 'PI3K-Akt signaling pathway' similar to the 66 genes.However, for the 25 genes, GO phrases related to infection such as 'Epstein-Barr virus infection' and 'Kaposi sarcoma-associated herpesvirus infection' are also enriched.This is interesting, as studies have shown that infections can lead to uncontrolled metastasis in mammalian cells by activating various signaling cascades [40][41][42] .For example, Lee et al. 40 demonstrated the downregulation of the epithelial tight junction protein E-cadherin in gastric cancer cells Table 4. Enrichment analyses showing the top 10 significant GO terms associated with the 66 hub genes commonly identified as top-ranked by five ranking algorithms.www.nature.com/scientificreports/infected with H. pylori cytotoxin-associated gene A (CagA).GSK-3 which induces the degradation of oncogenic proteins such as Snail, c-Myc, and Mcl-1 is also reduced with CagA infection.These results showed that CagA infection facilitates the transcriptional repressor, Snail, to suppress E-cadherin, which leads to EMT and metastasis.They also used the chorioallantoic membrane (CAM) assay to show CagA induces non-invasive MCF-7 cells to exhibit in-vivo invasive progression 40 .Chow et al. 41 showed non-small lung cancer cells infected with E. coli also exhibit increased cell adhesion, migration and metastasis via TLR4 signaling.Moreover, Wynendaele et al. 42 demonstrated that bacterial quorum sensing peptides activate the Ras/Raf/MEK/MAPK, PI3K/Akt, and STAT intracellular signaling cascades in mammalian cells.They further show bacterial quorum sensing peptide upregulates HIST1H4, and observed EGFR hyperphosphorylation and activation of Smad2/Smad3 protein linked with cell cytoskeleton rearrangement and cell migration.These results confirm that infection leads to genetic alterations and cancer metastasis through several signaling cascades, which includes 'PI3K-Akt signaling pathway' , another GO phrase enriched for the 25 genes.
To further determine the key microRNA associated with the 66 and 25 commonly identified genes, we used miRNet 19 .For the 66 genes, we used a betweenness filter of 14,800 to obtain the top 10 miRNA.Subsequently, we used the 'Function Explorer' in miRNet to obtain the diseases, functions, and clusters significantly associated with the identified miRNA.

Concluding remarks
The detection and analysis of CTCs offer invaluable real-time insights into tumor evolution.They serve as a blood-based biomarker for early tumor diagnosis, disease recurrence, and metastatic spread and also a possible avenue for gauging therapeutic response and developing personalized medicine.However, there are several challenges in CTC data analysis.CTCs are rare, with a frequency of one CTC per billion normal blood cells 56 .They also have a short half-life 57 .CTCs originating from different cancer types vary significantly in size, seeding potential, and cell surface marker expression 58 .Enumerating CTCs is an arduous task prone to user bias, but it holds prognostic value, and the additional characterization of these cells can corroborate clinically relevant and treatment-specific acumen.On another hand, ML techniques, compared to traditional statistical analysis, offer objectivity, rapid execution, the ability to overcome noise, flexibility, and reduced human intervention in analyzing CTCs data.Using DL on gene expression can provide insights into tumor biology and improve our understanding of cancer biology.It can help identify key genes and pathways that are altered in different cancer types, which could reveal new targets for drug development.
This study used CTC samples from six cancer types: breast, colorectal, prostate, non-small cell lung, pancreatic, melanoma, and liver cancer to build ML and DL models that we tested on three external Gene Expression Omnibus (GEO) datasets.Feature selection was used in ML and DL prediction workflows.In ML, the PPI Network was used to generate a physical protein-protein interaction (PPI) network, and the top 100 hub genes were ranked using the five ranking algorithms.While DeepLIFT was used to identify genes for the DL model, calculating contribution scores for each neuron in the input layer.The top hub genes chosen by the five ranking algorithms were used in the study to create ML models (SVM, RF, and KNN).The SVM classifier came in second place, with the RF classifier consistently producing the best results.The MNC and BC top 100 ranked hub genes provided the best and second-best prediction results, respectively.
On the other hand, the Deep Neural Network model achieved an average precision of 0.9652, recall of 0.9640, f1-score of 0.9638, and overall accuracy of 0.9640.Therefore, it offered significantly improved generic capabilities and performed better than the best-performing ML model.We further assessed the robustness of two best-constructed RF and DNN models using three independent datasets.RF/MNC and RF/BC models achieved acceptable prediction performances, with F1-scores of 0.6667 and 0.7647, respectively.The DNN models, constructed from the whole raw mRNA data set (DNNraw) and the top 100 genes as determined by Deep-LIFT (DNNdeeplift), achieved acceptable prediction performances.However, the DNNraw model performed better than the DNNdeeplift, RF/MNC, and RF/BC models.It is important to note that despite the strength of the DNNraw model, the 100 top-ranked genes represented by DNNdeeplift did not achieve better prediction performance than the RF/MNC and RF/BC models.
Enrichment analysis was performed on the hub genes, which showed that they were significantly involved in body size, embryonic lethality, abnormal cell cycle, decreased fibroblast proliferation, decreased immature B cell number, cancer-related pathways like bladder cancer, breast cancer, transcriptional misregulation, microRNAs, and the PI3K-Akt signaling pathway as revealed by GO analysis.The enrichment of the PI3K-AKT signaling pathway is commonly observed in many human cancers, including breast, lung, ovarian, and prostate.However, this pathway activation time varies among cancer types and patients.These findings underscore the crucial role of PI3K-Akt-related genes in classifying the metastasis tumor cells 59 .Moreover, GO phrases related to infection, such as Epstein-Barr virus infection and Kappi sarcoma-associated herpesvirus infection, were also enriched.Studies have shown that infections can lead to uncontrolled metastasis in mammalian cells through activating various signaling cascades.For example, CagA infection downregulates E-cadherin, GSK-3, and oncogenic proteins, leading to EMT and metastasis.Bacterial quorum sensing peptides activate intracellular signaling cascades, upregulating HIST1H4, and EGFR hyperphosphorylation.These findings confirm that infection leads to genetic alterations and cancer metastasis through various signaling cascades, and this finding being picked up by our models suggests that preventing infection in cancer patients may be key to preventing cancer progression to metastasis.
Despite the potential advantages of using DL on gene expression using cfDNA, this approach has several limitations.One major challenge is the lack of standardization in collecting, processing, and analyzing cfDNA samples, leading to significant variability between different studies.Therefore, establishing standards and protocols for sample collection, processing, and analysis is necessary.Another area for improvement is that more sensitive and precise analytical techniques are required to ensure the most minuscule amounts of cfDNA in the blood are detectable.Another challenge is the dependence of DL models on existing data for training, and there needs to be more diverse and representative datasets for cfDNA analysis.Datasets should be large and diverse enough to include multiple cancer types, cancer subtypes, and different stages of cancer for the development of robust DL models.
Our model overcomes one of these limitations through the use of raw unprocessed data, and in future work, we intend to integrate multi-omics datasets such as proteomic, epigenetic, and transcriptomic data with DL models to enable innovative biomarker discovery. https://doi.org/10.1038/s41598-023-47805-2

Figure 2 .
Figure 2. Column chart depicts the prediction performance of the DL model using (1) the entire raw mRNA data set and (2) the top 100 ranked genes determined by DeepLIFT.It is evident from the image that the DL model's prediction performance using the top 100 ranked genes is only approximately 7% lower than the performance achieved using the entire raw mRNA dataset.This striking similarity suggests that these selected genes play a crucial role in contributing to the overall performance of the DL model.

Figure 3 .
Figure 3. Column chart illustrating the prediction performances of the best-constructed models, RF (RF/MNC and RF/BC) and DNN (DNNraw and DNNdeeplift) on three independent datasets (GSE153514, GSE82198, and GSE144561, see Table1).It is evident from the chart that the RF models, constructed using the top 100 ranked hub genes determined by the MNC local ranking algorithm, outperformed the RF models built with the BC global ranking algorithm.Furthermore, DNNraw achieved slightly better performances than both DNNdeeplift and the RF models across the datasets.3.4Mining influential genes.

Figure 5 .
Figure 5. Network generated by miRNet.It shows 10 important miRNAs, represented by blue squares, that are predicted to target the 66 hub genes (represented by pink circles) commonly identified by the five ranking algorithms as top-ranked.

Table 1 .
. Statistics of the training and testing data.

Table 2 .
Parameter search space for optimizing SVM, RF, KNN, and DNN models.Best parameters are in [bold].

Table 3 .
The prediction performances of SVM, RF, and KNN when fed the top 100 hub genes determined by five ranking algorithms.The bold and italics results indicate each ranking algorithm's best and second-best performing models.

Table 5 .
The top 20 genes from among the 66 hub genes involved in cancer pathways based on enrichment analysis using the KEGG database.

Table 6 .
Enrichment analyses show the top 10 significant GO terms associated with the 25 genes commonly identified by the five ranking algorithms and DeepLIFT as top-ranked.